112 research outputs found
Robust Lasso-Zero for sparse corruption and model selection with missing covariates
We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology
[Descloux and Sardy, 2018], initially introduced for sparse linear models, to
the sparse corruptions problem. We give theoretical guarantees on the sign
recovery of the parameters for a slightly simplified version of the estimator,
called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased
for variable selection with missing values in the covariates. In addition to
not requiring the specification of a model for the covariates, nor estimating
their covariance matrix or the noise variance, the method has the great
advantage of handling missing not-at random values without specifying a
parametric model. Numerical experiments and a medical application underline the
relevance of Robust Lasso-Zero in such a context with few available
competitors. The method is easy to use and implemented in the R library lass0
Missing Data Imputation using Optimal Transport
Missing data is a crucial issue when applying machine learning algorithms to
real-world datasets. Starting from the simple assumption that two batches
extracted randomly from the same dataset should share the same distribution, we
leverage optimal transport distances to quantify that criterion and turn it
into a loss function to impute missing data values. We propose practical
methods to minimize these losses using end-to-end learning, that can exploit or
not parametric assumptions on the underlying distributions of values. We
evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR
settings. These experiments show that OT-based methods match or out-perform
state-of-the-art imputation methods, even for high percentages of missing
values
Natural Antisense Transcripts: Molecular Mechanisms and Implications in Breast Cancers.
Natural antisense transcripts are RNA sequences that can be transcribed from both DNA strands at the same locus but in the opposite direction from the gene transcript. Because strand-specific high-throughput sequencing of the antisense transcriptome has only been available for less than a decade, many natural antisense transcripts were first described as long non-coding RNAs. Although the precise biological roles of natural antisense transcripts are not known yet, an increasing number of studies report their implication in gene expression regulation. Their expression levels are altered in many physiological and pathological conditions, including breast cancers. Among the potential clinical utilities of the natural antisense transcripts, the non-coding|coding transcript pairs are of high interest for treatment. Indeed, these pairs can be targeted by antisense oligonucleotides to specifically tune the expression of the coding-gene. Here, we describe the current knowledge about natural antisense transcripts, their varying molecular mechanisms as gene expression regulators, and their potential as prognostic or predictive biomarkers in breast cancers
Model-based Clustering with Missing Not At Random Data
Traditional ways for handling missing values are not designed for the
clustering purpose and they rarely apply to the general case, though frequent
in practice, of Missing Not At Random (MNAR) values. This paper proposes to
embed MNAR data directly within model-based clustering algorithms. We introduce
a mixture model for different types of data (continuous, count, categorical and
mixed) to jointly model the data distribution and the MNAR mechanism. Eight
different MNAR models are proposed, which may depend on the underlying
(unknown) classes and/or the values of the missing variables themselves. We
prove the identifiability of the parameters of both the data distribution and
the mechanism, whatever the type of data and the mechanism, and propose an EM
or Stochastic EM algorithm to estimate them. The code is available on
\url{https://github.com/AudeSportisse/Clustering-MNAR}.
%\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove
that MNAR models for which the missingness depends on the class membership have
the nice property that the statistical inference can be carried out on the data
matrix concatenated with the mask by considering a MAR mechanism instead.
Finally, we perform empirical evaluations for the proposed sub-models on
synthetic data and we illustrate the relevance of our method on a medical
register, the TraumaBase^{\mbox{\normalsize{\textregistered}}} dataset
Debiasing Stochastic Gradient Descent to handle missing values
International audienceStochastic gradient algorithm is a key ingredient of many machine learning methods, particularly appropriate for large-scale learning.However, a major caveat of large data is their incompleteness.We propose an averaged stochastic gradient algorithm handling missing values in linear models. This approach has the merit to be free from the need of any data distribution modeling and to account for heterogeneous missing proportion.In both streaming and finite-sample settings, we prove that this algorithm achieves convergence rate of at the iteration , the same as without missing values. We show the convergence behavior and the relevance of the algorithm not only on synthetic data but also on real data sets, including those collected from medical register
High strength-high conductivity carbon nanotube-copper wires with bimodal grain size distribution by spark plasma sintering and wire-drawing
Copper and 1 vol% carbon nanotube-copper cylinderswith a micrometric copper grain size and either a unimodal or a bimodal grain size distribution were prepared using spark plasma sintering. The cylinders served as starting materials for room temperature wire-drawing, enabling the preparation of conducting wires with ultrafine grains. The tensile strength for the carbon nanotube-copperwires is higher than for the corresponding pure copper wires. We show that the bimodal grain size distribution favors strengthening while limiting the increase in electrical resistivity of the wires, both for pure copper and for the composites
Cancer du sein : de la thérapie ciblée à la médecine personnalisée
peer reviewedDans cet article, les auteurs passent en revue les grands principes de prise en charge du traitement systémique du cancer du sein et posent la question suivante : jusqu'où réellement aujourd'hui ce traitement est-il individualisé ? Les nouvelles technologies permettent une analyse détaillée des anomalies génomiques au niveau des cellules cancéreuses. Malheureusement, nous n'avons pas encore compris comment utiliser au mieux ces données au bénéfice du patient. La majorité des modifications du génome sont des évènements relativement rares compliquant le développement de nouveaux médicaments dans le cadre d'une médecine de précision. De plus, les tumeurs présentent une grande hétérogénéité temporelle et spatiale dont il faudra tenir compte lors de ce développement. Une collaboration internationale intensive est en cours pour tenter de confirmer que la médecine de précision permet d'optimiser les résultats du traitement systémique dans le cancer du sein
Normalization and correction for batch effects via RUV for RNA-seq data: practical implications for Breast Cancer Research
The whole transcriptome contains information about nonsense, missense, silent, in-frame and frameshift mutations, as observed at whole-exome level, as well as splicing and (allelic) gene-expression changes which are missed by DNA analysis. One important step in the analysis of gene expression data arising from RNA-seq is the detection of differential expression (DE) levels. Several methods are available and the choice is sometimes controversial. For a reliable DE analysis that reduces False Positive DE genes, and accurate estimation of gene expression levels, a good and suitable normalization approach (including correction for confounders) is mandatory. Several normalization methods have been proposed to correct for both within-sample and between-sample biases. RUV (Removing Unwanted Variation) is one of them and has the advantage to correct for batch effects including potentially unknown unwanted variation in gene expression. In this study, we present a comparison on real-life Illumina paired-end sequencing data for Estrogen-Receptor-Positive (ER+) Breast Cancer tissues versus matched controls between RUV (RUVg using in silico negative control genes) and more commonly used methods for RNA-seq data normalization, such as DESeq2, edgeR, and UQ. The set of in silico empirical negative control genes for RUVg was defined as the set of least significant DE genes obtained after a first DE analysis performed prior to RUVg correction. Box plots of relative log expression (RLE) among the samples and PCA plots show that RUVg performs well and leads to a stabilization of read count across samples with a clear clustering of biological replicates
Assessing Random Forest self-reproducibility for optimal short biomarker signature discovery
Biomarker signature discovery remains the main path to develop clinical diagnostic tools when the biological knowledge on a pathology is weak. Shortest signatures are often preferred to reduce the cost of the diagnostic. The ability to find the best and shortest signature relies on the robustness of the models that can be built on such set of molecules. The classification algorithm that will be used is selected based on the average performance of its models, often expressed via the average AUC. However, it is not garanteed that an algorithm with a large AUC distribution will keep a stable performance when facing data. Here, we propose two AUC-derived hyper-stability scores, the HRS and the HSS, as complementary metrics to the average AUC, that should bring confidence in the choice for the best classification algorithm. To emphasize the importance of these scores, we compared 15 different Random Forests implementation. Additionally, the modelization time of each implementation was computed to further help deciding the best strategy. Our findings show that the Random Forest implementation should be chosen according to the data at hand and the classification question being evaluated. No Random Forest implementation can be used universally for any classification and on any dataset. Each of them should be tested for both their average AUC performance and AUC-derived stability, prior to analysis.Author summaryTo better measure the performance of a Machine Learning (ML) implementation, we introduce a new metric, the AUC hyper-stability, to be used in parallel with the average AUC. This AUC hyper-stability is able to discriminate ML implementations that show the same AUC performance. This metric can therefore help researchers in choosing the best ML method to get stable short predictive biomarker signatures. More specifically, we advocate a tradeoff between the average AUC performance, the hyper-stability scores, and the modeling time
- …